Class 24

DATA1220-55, Fall 2024

Sarah E. Grabinski

2024-10-30

Population Parameters versus Sample Statistics

Sample statistics are used to estimate unknowable population parameters
Measure Sample Statistic Population Parameter
Mean \(\bar{x}\) \(\mu\)
Proportion \(\hat{p}\) \(p\)
Difference in Means \(\bar{x}_1-\bar{x}_2\) \(\mu_1-\mu_2\)
Difference in Proportions \(\hat{p}_1 - \hat{p}_2\) \(p_1 - p_2\)
Standard Deviation \(s\) \(\sigma\)

Assumptions

  • Data is reliable and valid
    • \(\bar{x}_{\operatorname{observed}} \approx \bar{x}_{\operatorname{expected}}\) or \(\hat{p}_{\operatorname{observed}} \approx \hat{p}_{\operatorname{expected}}\) when data is reliable
    • \(\bar{x}_{\operatorname{observed}} \approx \mu\) or \(\hat{p}_{\operatorname{observed}} \approx p\) when data is valid
  • Observations are independent and identically distributed
  • Sufficient sample size

    • \(n \ge 30\) for \(\bar{x}\) (means)

    • \(n \ge 20\), \(n_{x=1} \ge 10\), & \(n_{x=0} \ge 10\) for \(\hat{p}\) (proportions)

  • For means, the observed distribution in the sample approximates a normal distribution (less strict as \(n \to \infty\))

The Central Limit Theorem

The distribution of the sample statistic \(\bar{x}\) or \(\hat{p}\) approximates the normal distribution \(N\left(\text{population parameter}, \text{standard error}\right)\) as \(n \to \infty\).

  • \(\bar{x} \sim N\left(\mu, \frac{\sigma}{n}\right)\))

  • \(\hat{p} \sim N\left(p, \sqrt{\frac{p (1-p)}{n}}\right)\)

The sampling distribution is normal with \(\mu=\text{sample statistic}\) and \(\sigma=\operatorname{standard error}\).

Standard Error of Sample Mean \(\bar{x}\)

The standard deviation of the sampling distribution of \(\bar{x}\) is the population standard deviation \(\sigma\) divided by the square root of the size of the sample \(n\).

\[ \begin{aligned} SE_{\bar{x}} &= \frac{\sigma}{\sqrt{n}} \end{aligned} \]

Because we don’t have access to the “true” value of \(\sigma\), we substitute the observed standard deviation in the sample \(s\) for inference and hypothesis testing.

\[ \begin{aligned} SE_{\bar{x}} &= \frac{s}{\sqrt{n}} \end{aligned} \]

Population Distribution of \(x\)

The population in this figure has the “true” parameters of mean \(\mu=0\) and standard deviation \(\sigma=1\).

Sampling Distribution for \(\bar{x}\)

The sampling distribution is the distribution of sample statistics \(\bar{x}\) from samples with size \(n\) taken from the population \(x \sim N(0, 1)\), were you to sample infinite times.

Side-By-Side

Distribution of \(\bar{x}\) versus \(x\)

Observed values of \(x\) are more variable than observed values of \(\bar{x}\).

Example

  • The “true” distribution in your population is normal with mean \(\mu = 0\) and standard deviation \(\sigma = 1\) (\(x \sim N(0, 1)\))
  • Take repeated samples of \(n = 50\) from the population \(x \sim N(0, 1)\).
  • Calculate \(\bar{x}_i\) for each sample of size \(n\).
  • Compare the observed distribution of \(\bar{x}_i\) to the expected distribution \(\bar{x} \sim N\left(0, \frac{1}{\sqrt{n}} \right)\).

Observed Distributions of \(x\) in Samples

Observed Sample Means

Sample ID x_bar s
1 -0.047 0.981
2 0.178 0.835
3 0.024 0.870
4 -0.094 0.907
5 -0.184 1.148
6 0.154 0.942
7 0.015 0.967

Observed Distributions vs Expected Distribution

Observed Distributions vs Expected Distribution

Standard Error of Sample Proportion \(\hat{p}\)

The standard deviation of the sampling distribution of \(\hat{p}\) for sample size \(n\) is…

\[ \begin{aligned} SE_{\bar{x}} &= \sqrt{\frac{p(1-p)}{n}} \end{aligned} \]

Because we don’t have access to the “true” value of \(p\), we substitute the observed statistic in the sample \(\hat{p}\) for inference and hypothesis testing.

\[ \begin{aligned} SE_{\hat{p}} &= \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \end{aligned} \]

Example

  • The “true” distribution in your population is \(p=0.5\)
  • Take repeated samples of \(n = 50\) from the population \(p=0.5\)
  • Calculate \(\hat{p}_i\) for each sample of size \(n\).
  • Compare the observed distribution of \(\hat{p}_i\) to the expected distribution \(\hat{p} \sim N\left(0.5, \sqrt{\frac{0.5(1-0.5)}{n}} \right)\).

Observed Distributions of \(\hat{p}\) in Samples

Observed Sample Proportions

Sample ID p_hat SE
1 0.533 0.091
2 0.667 0.086
3 0.400 0.089
4 0.500 0.091
5 0.433 0.090
6 0.600 0.089
7 0.667 0.086

Observed Distributions vs Expected Distribution